This is a library to find the best performing configuration from a set of dimensions (i.e. schemas, partition, storage) which can be specified inside the settings.yaml file in the resource¶

In [ ]:
%pip install PAPyA==0.1.0

Load the configuration file and log files location for the experiment¶

Configurations for SP2Bench Data

In [1]:
config_sp2bench = "settings.yaml" # config file location
logs_sp2bench = "log" # logs file location

Configurations for Watdiv Data

In [2]:
config_watdiv = "settings_watdiv.yaml" # config file location
logs_watdiv = "log_watdiv" # logs file location

Configuration file

The configuration file is a yaml data-serialization language which has two main parts, the dimensions and the number of query experiments. You can add more dimensions here or change these existing dimensions to anything you need

Example :

dimensions:
    schemas: ["st", "vt", "pt"]
    partition: ["horizontal", "subject"]
    storage: ["parquet", "orc"]

query: 20

Log file structure

the structure of the log files must follow the order of dimensions in the configuration file (i.e. {schemas}.{partition}.{storage}.txt) and the subfolders should be the ranking sets of the experiments (i.e. dataset sizes)

Example :

UI Module
└───log
    │
    |───100M
    |    │   st.horizontal.csv.txt
    |    │   st.horizontal.avro.txt
    |    │   ...
    │
    └───250M
        |   st.horizontal.csv.txt
        │   st.horizontal.avro.txt
        │   ...

Single Dimensional Ranking¶

SDRank is a class module from PAPyA library to calculate ranking score R for each dimension independently that operates over a log-based structure that user specified on the configuration file.
The value of R represents the performance of a particular configuration (higher value means better performing configuration). We used Ranking Function R below to calculate the rank scores:

$$R =\sum \limits _{r=1} ^{d} \frac{O_{dim} * (d-r)}{|Q| * (d-1)}, 0<R<=1$$

$d$ : total number of parameters (options) in a particular dimension
$O_{dim}$ : number of occurences of the dimension placed at rank $r$ (Rank 1, Rank 2, Rank 3, ...)
$|Q|$ : total number of queries

PAPyA.Rank.SDRank¶

class Rank.SDRank(config_path, log_path, ranking_sets, dimension)¶

Parameters:
  config_path : str
  Specify the path to your configuration file(s). i.e ./UIModule/settings_watdiv.yaml</i>
 log_path : str
  Specify the path to your log file(s). i.e ./UI Module/log_watdiv</i>
 ranking_sets : str
  Ranking sets of user choice. i.e dataset sizes (100M)</i>
 dimension : str
  A single dimension to be ranked. i.e schemas</i>

In [3]:
# this class takes single dimension and dataset sizes as parameters that user specified inside their log files
from Rank import SDRank

schemaSDRank_100M = SDRank(config_watdiv, logs_watdiv, '100M', 'schemas')
schemaSDRank_250M = SDRank(config_watdiv, logs_watdiv, '250M', 'schemas')
schemaSDRank_500M = SDRank(config_watdiv, logs_watdiv, '500M', 'schemas')
partitionSDRank_100M = SDRank(config_watdiv, logs_watdiv, '100M', 'partition')
partitionSDRank_250M = SDRank(config_watdiv, logs_watdiv, '250M', 'partition')
partitionSDRank_500M = SDRank(config_watdiv, logs_watdiv, '500M', 'partition')
storageSDRank_100M = SDRank(config_watdiv, logs_watdiv, '100M', 'storage')
storageSDRank_250M = SDRank(config_watdiv, logs_watdiv, '250M', 'storage')
storageSDRank_500M = SDRank(config_watdiv, logs_watdiv, '500M', 'storage')

Rank.SDRank.calculateRank¶

SDRank.calculateRank(*args)¶

The function that automates calculating the rank scores of a single dimension using the Ranking Function above.

Returns a table of configurations which is sorted based on the best performing configuration according to their Ranking Score along with number of occurences of the dimension being placed at the rank _r_ (1st, 2nd, 3rd, ...)

Parameters:
  *args : str or list
  This method takes an arbitrary number of parameters of strings and lists.
   str -> slice the table according to string input. i.e. "predicate" will slice the table by the predicate partitioning
   list -> remove some queries out of the ranking calculations. i.e [7,8,9] will remove query 7, 8,and 9 from the calculation
</i>

In [4]:
schemaSDRank_100M.calculateRank()
Out[4]:
Rank 1 Rank 2 Rank 3 Result
pt.horizontal.parquet 17.0 3.0 0.0 0.925
pt.horizontal.orc 15.0 3.0 2.0 0.825
pt.subject.parquet 15.0 3.0 2.0 0.825
pt.subject.orc 11.0 7.0 2.0 0.725
st.subject.orc 7.0 8.0 5.0 0.550
vp.horizontal.parquet 3.0 13.0 4.0 0.475
st.subject.parquet 3.0 11.0 6.0 0.425
vp.horizontal.orc 3.0 8.0 9.0 0.350
st.horizontal.orc 2.0 9.0 9.0 0.325
vp.subject.parquet 2.0 6.0 12.0 0.250
vp.subject.orc 2.0 5.0 13.0 0.225
st.horizontal.parquet 0.0 4.0 16.0 0.100
In [5]:
schemaSDRank_250M.calculateRank()
Out[5]:
Rank 1 Rank 2 Rank 3 Result
pt.horizontal.parquet 17.0 3.0 0.0 0.925
pt.horizontal.orc 16.0 4.0 0.0 0.900
pt.subject.parquet 16.0 4.0 0.0 0.900
pt.subject.orc 14.0 6.0 0.0 0.850
vp.horizontal.orc 4.0 15.0 1.0 0.575
vp.horizontal.parquet 3.0 17.0 0.0 0.575
vp.subject.parquet 3.0 12.0 5.0 0.450
vp.subject.orc 4.0 8.0 8.0 0.400
st.subject.orc 2.0 6.0 12.0 0.250
st.subject.parquet 1.0 4.0 15.0 0.150
st.horizontal.orc 0.0 1.0 19.0 0.025
st.horizontal.parquet 0.0 0.0 20.0 0.000
In [6]:
schemaSDRank_500M.calculateRank()
Out[6]:
Rank 1 Rank 2 Rank 3 Result
pt.subject.parquet 16.0 4.0 0.0 0.900
pt.horizontal.parquet 15.0 5.0 0.0 0.875
pt.horizontal.orc 14.0 6.0 0.0 0.850
pt.subject.orc 13.0 6.0 1.0 0.800
vp.horizontal.orc 6.0 14.0 0.0 0.650
vp.horizontal.parquet 5.0 15.0 0.0 0.625
vp.subject.orc 6.0 13.0 1.0 0.625
vp.subject.parquet 4.0 16.0 0.0 0.600
st.subject.orc 1.0 1.0 18.0 0.075
st.horizontal.orc 0.0 0.0 20.0 0.000
st.horizontal.parquet 0.0 0.0 20.0 0.000
st.subject.parquet 0.0 0.0 20.0 0.000
In [7]:
partitionSDRank_100M.calculateRank()
Out[7]:
Rank 1 Rank 2 Result
st.subject.orc 20.0 0.0 1.00
st.subject.parquet 20.0 0.0 1.00
pt.horizontal.orc 13.0 7.0 0.65
vp.subject.orc 13.0 7.0 0.65
vp.horizontal.parquet 12.0 8.0 0.60
pt.subject.parquet 11.0 9.0 0.55
pt.horizontal.parquet 9.0 11.0 0.45
vp.subject.parquet 8.0 12.0 0.40
pt.subject.orc 7.0 13.0 0.35
vp.horizontal.orc 7.0 13.0 0.35
st.horizontal.orc 0.0 20.0 0.00
st.horizontal.parquet 0.0 20.0 0.00
In [8]:
partitionSDRank_250M.calculateRank()
Out[8]:
Rank 1 Rank 2 Result
st.subject.orc 20.0 0.0 1.00
st.subject.parquet 20.0 0.0 1.00
pt.horizontal.parquet 14.0 6.0 0.70
vp.horizontal.parquet 14.0 6.0 0.70
vp.subject.orc 11.0 9.0 0.55
pt.horizontal.orc 10.0 10.0 0.50
pt.subject.orc 10.0 10.0 0.50
vp.horizontal.orc 9.0 11.0 0.45
pt.subject.parquet 6.0 14.0 0.30
vp.subject.parquet 6.0 14.0 0.30
st.horizontal.orc 0.0 20.0 0.00
st.horizontal.parquet 0.0 20.0 0.00
In [9]:
partitionSDRank_500M.calculateRank()
Out[9]:
Rank 1 Rank 2 Result
st.subject.orc 20.0 0.0 1.00
st.subject.parquet 20.0 0.0 1.00
pt.subject.orc 15.0 5.0 0.75
pt.subject.parquet 15.0 5.0 0.75
vp.horizontal.parquet 13.0 7.0 0.65
vp.subject.orc 11.0 9.0 0.55
vp.horizontal.orc 9.0 11.0 0.45
vp.subject.parquet 7.0 13.0 0.35
pt.horizontal.orc 5.0 15.0 0.25
pt.horizontal.parquet 5.0 15.0 0.25
st.horizontal.orc 0.0 20.0 0.00
st.horizontal.parquet 0.0 20.0 0.00
In [10]:
storageSDRank_100M.calculateRank()
Out[10]:
Rank 1 Rank 2 Result
st.subject.orc 19.0 1.0 0.95
st.horizontal.orc 16.0 4.0 0.80
vp.horizontal.parquet 13.0 7.0 0.65
vp.subject.orc 12.0 8.0 0.60
pt.horizontal.parquet 11.0 9.0 0.55
pt.subject.parquet 11.0 9.0 0.55
pt.horizontal.orc 9.0 11.0 0.45
pt.subject.orc 9.0 11.0 0.45
vp.subject.parquet 8.0 12.0 0.40
vp.horizontal.orc 7.0 13.0 0.35
st.horizontal.parquet 4.0 16.0 0.20
st.subject.parquet 1.0 19.0 0.05
In [11]:
storageSDRank_250M.calculateRank()
Out[11]:
Rank 1 Rank 2 Result
st.subject.orc 19.0 1.0 0.95
st.horizontal.orc 18.0 2.0 0.90
pt.horizontal.parquet 15.0 5.0 0.75
vp.subject.orc 15.0 5.0 0.75
vp.horizontal.orc 11.0 9.0 0.55
pt.subject.orc 10.0 10.0 0.50
pt.subject.parquet 10.0 10.0 0.50
vp.horizontal.parquet 9.0 11.0 0.45
pt.horizontal.orc 5.0 15.0 0.25
vp.subject.parquet 5.0 15.0 0.25
st.horizontal.parquet 2.0 18.0 0.10
st.subject.parquet 1.0 19.0 0.05
In [12]:
storageSDRank_500M.calculateRank()
Out[12]:
Rank 1 Rank 2 Result
st.horizontal.orc 20.0 0.0 1.00
st.subject.orc 20.0 0.0 1.00
pt.horizontal.orc 15.0 5.0 0.75
vp.subject.orc 15.0 5.0 0.75
pt.subject.orc 12.0 8.0 0.60
vp.horizontal.orc 11.0 9.0 0.55
vp.horizontal.parquet 9.0 11.0 0.45
pt.subject.parquet 8.0 12.0 0.40
pt.horizontal.parquet 5.0 15.0 0.25
vp.subject.parquet 5.0 15.0 0.25
st.horizontal.parquet 0.0 20.0 0.00
st.subject.parquet 0.0 20.0 0.00

Rank.SDRank.plotRadar¶

SDRank.plotRadar()¶

Ranking over one dimension is insufficient when it counts multiple dimensions. The presence of trade-offs reduces the accuracy of single dimension ranking functions which could be seen in the radar plot below.

This method returns a radar chart that shows the presence of trade-offs by using the single dimension ranking criterion that reduces the accuracy of the other dimensions

In [13]:
# This example shows a figure of the top configuration of ranking by schema is optimized towards its dimension only, ignoring the other two dimension.
from Rank import SDRank
SDRank(config_watdiv, logs_watdiv, '100M', 'schemas').plotRadar()
SDRank(config_watdiv, logs_watdiv, '100M', 'partition').plotRadar()
SDRank(config_watdiv, logs_watdiv, '100M', 'storage').plotRadar()
SDRank(config_watdiv, logs_watdiv, '250M', 'schemas').plotRadar()
SDRank(config_watdiv, logs_watdiv, '250M', 'partition').plotRadar()
SDRank(config_watdiv, logs_watdiv, '250M', 'storage').plotRadar()
SDRank(config_watdiv, logs_watdiv, '500M', 'schemas').plotRadar()
SDRank(config_watdiv, logs_watdiv, '500M', 'partition').plotRadar()
SDRank(config_watdiv, logs_watdiv, '500M', 'storage').plotRadar()

Rank.SDRank.plot¶

SDRank.plot(view)¶

In addition to radar plot, PAPyA also provides visualization that shows the performance of a single dimension parameters that user can choose in terms of their rank scores

This method returns a bar chart that shows a particular dimension rankings, pivoting another dimension by user viewing projection

Parameters:
  view : str
  This method takes a string of dimensional option that user choose to view as projection

Example:
  _Schemas_ single dimensional ranking viewed from _Predicate Partitioning_ by pivoting over the _Storage_ dimension

In [14]:
from Rank import SDRank

# example of schema dimension plots
SDRank(config_watdiv, logs_watdiv, '100M', 'schemas').plot('horizontal')
SDRank(config_watdiv, logs_watdiv, '100M', 'schemas').plot('predicate')
SDRank(config_watdiv, logs_watdiv, '100M', 'schemas').plot('subject')
SDRank(config_watdiv, logs_watdiv, '100M', 'schemas').plot('orc')
SDRank(config_watdiv, logs_watdiv, '100M', 'schemas').plot('avro')
SDRank(config_watdiv, logs_watdiv, '100M', 'schemas').plot('csv')
SDRank(config_watdiv, logs_watdiv, '100M', 'schemas').plot('parquet')
SDRank(config_watdiv, logs_watdiv, '100M', 'partition').plot('st')
SDRank(config_watdiv, logs_watdiv, '100M', 'partition').plot('vp')
SDRank(config_watdiv, logs_watdiv, '100M', 'partition').plot('pt')
SDRank(config_watdiv, logs_watdiv, '100M', 'partition').plot('extvp')
SDRank(config_watdiv, logs_watdiv, '100M', 'partition').plot('wpt')
SDRank(config_watdiv, logs_watdiv, '100M', 'partition').plot('orc')
SDRank(config_watdiv, logs_watdiv, '100M', 'partition').plot('avro')
SDRank(config_watdiv, logs_watdiv, '100M', 'partition').plot('csv')
SDRank(config_watdiv, logs_watdiv, '100M', 'partition').plot('parquet')
SDRank(config_watdiv, logs_watdiv, '100M', 'storage').plot('st')
SDRank(config_watdiv, logs_watdiv, '100M', 'storage').plot('vp')
SDRank(config_watdiv, logs_watdiv, '100M', 'storage').plot('pt')
SDRank(config_watdiv, logs_watdiv, '100M', 'storage').plot('extvp')
SDRank(config_watdiv, logs_watdiv, '100M', 'storage').plot('wpt')
SDRank(config_watdiv, logs_watdiv, '100M', 'storage').plot('horizontal')
SDRank(config_watdiv, logs_watdiv, '100M', 'storage').plot('predicate')
SDRank(config_watdiv, logs_watdiv, '100M', 'storage').plot('subject')
SDRank(config_watdiv, logs_watdiv, '250M', 'schemas').plot('horizontal')
SDRank(config_watdiv, logs_watdiv, '250M', 'schemas').plot('predicate')
SDRank(config_watdiv, logs_watdiv, '250M', 'schemas').plot('subject')
SDRank(config_watdiv, logs_watdiv, '250M', 'schemas').plot('orc')
SDRank(config_watdiv, logs_watdiv, '250M', 'schemas').plot('avro')
SDRank(config_watdiv, logs_watdiv, '250M', 'schemas').plot('csv')
SDRank(config_watdiv, logs_watdiv, '250M', 'schemas').plot('parquet')
SDRank(config_watdiv, logs_watdiv, '250M', 'partition').plot('st')
SDRank(config_watdiv, logs_watdiv, '250M', 'partition').plot('vp')
SDRank(config_watdiv, logs_watdiv, '250M', 'partition').plot('pt')
SDRank(config_watdiv, logs_watdiv, '250M', 'partition').plot('extvp')
SDRank(config_watdiv, logs_watdiv, '250M', 'partition').plot('wpt')
SDRank(config_watdiv, logs_watdiv, '250M', 'partition').plot('orc')
SDRank(config_watdiv, logs_watdiv, '250M', 'partition').plot('avro')
SDRank(config_watdiv, logs_watdiv, '250M', 'partition').plot('csv')
SDRank(config_watdiv, logs_watdiv, '250M', 'partition').plot('parquet')
SDRank(config_watdiv, logs_watdiv, '250M', 'storage').plot('st')
SDRank(config_watdiv, logs_watdiv, '250M', 'storage').plot('vp')
SDRank(config_watdiv, logs_watdiv, '250M', 'storage').plot('pt')
SDRank(config_watdiv, logs_watdiv, '250M', 'storage').plot('extvp')
SDRank(config_watdiv, logs_watdiv, '250M', 'storage').plot('wpt')
SDRank(config_watdiv, logs_watdiv, '250M', 'storage').plot('horizontal')
SDRank(config_watdiv, logs_watdiv, '250M', 'storage').plot('predicate')
SDRank(config_watdiv, logs_watdiv, '250M', 'storage').plot('subject')
SDRank(config_watdiv, logs_watdiv, '500M', 'schemas').plot('horizontal')
SDRank(config_watdiv, logs_watdiv, '500M', 'schemas').plot('predicate')
SDRank(config_watdiv, logs_watdiv, '500M', 'schemas').plot('subject')
SDRank(config_watdiv, logs_watdiv, '500M', 'schemas').plot('orc')
SDRank(config_watdiv, logs_watdiv, '500M', 'schemas').plot('avro')
SDRank(config_watdiv, logs_watdiv, '500M', 'schemas').plot('csv')
SDRank(config_watdiv, logs_watdiv, '500M', 'schemas').plot('parquet')
SDRank(config_watdiv, logs_watdiv, '500M', 'partition').plot('st')
SDRank(config_watdiv, logs_watdiv, '500M', 'partition').plot('vp')
SDRank(config_watdiv, logs_watdiv, '500M', 'partition').plot('pt')
SDRank(config_watdiv, logs_watdiv, '500M', 'partition').plot('extvp')
SDRank(config_watdiv, logs_watdiv, '500M', 'partition').plot('wpt')
SDRank(config_watdiv, logs_watdiv, '500M', 'partition').plot('orc')
SDRank(config_watdiv, logs_watdiv, '500M', 'partition').plot('avro')
SDRank(config_watdiv, logs_watdiv, '500M', 'partition').plot('csv')
SDRank(config_watdiv, logs_watdiv, '500M', 'partition').plot('parquet')
SDRank(config_watdiv, logs_watdiv, '500M', 'storage').plot('st')
SDRank(config_watdiv, logs_watdiv, '500M', 'storage').plot('vp')
SDRank(config_watdiv, logs_watdiv, '500M', 'storage').plot('pt')
SDRank(config_watdiv, logs_watdiv, '500M', 'storage').plot('extvp')
SDRank(config_watdiv, logs_watdiv, '500M', 'storage').plot('wpt')
SDRank(config_watdiv, logs_watdiv, '500M', 'storage').plot('horizontal')
SDRank(config_watdiv, logs_watdiv, '500M', 'storage').plot('predicate')
SDRank(config_watdiv, logs_watdiv, '500M', 'storage').plot('subject')
C:\Users\satri\anaconda3\envs\RDF_BenchRankingLib\lib\site-packages\pandas\plotting\_matplotlib\core.py:386: RuntimeWarning:

More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).

Out[14]:
<AxesSubplot:title={'center':'Storage SD Rank pivoting Schemas formats for Subject Partition'}>

Rank.SDRank.plotBox¶

SDRank.plotBox(q = None)¶

In order to show the distributions of our query runtimes data, we need a box plot diagram to compare these data between queries in our experiment. Box plot can help provide information at a glance, giving us general information about our data

This method returns a box plot diagram that shows both maximum and minimum runtimes of each individual queries for a particular single dimensional ranking

Parameters:
  q : list
  This method can take a list of column names that user can specify to choose for plotting to have a more precise box plot diagram. i.e. list of ["Q1", "Q2", "Q3"] will output a
   box plot only for those queries

In [15]:
from Rank import SDRank

# Box plot example of all queries of schema ranking dimension
SDRank(config_watdiv, logs_watdiv, '100M', 'schemas').plotBox()
SDRank(config_watdiv, logs_watdiv, '100M', 'partition').plotBox()
SDRank(config_watdiv, logs_watdiv, '100M', 'storage').plotBox()
SDRank(config_watdiv, logs_watdiv, '250M', 'schemas').plotBox()
SDRank(config_watdiv, logs_watdiv, '250M', 'partition').plotBox()
SDRank(config_watdiv, logs_watdiv, '250M', 'storage').plotBox()
SDRank(config_watdiv, logs_watdiv, '500M', 'schemas').plotBox()
SDRank(config_watdiv, logs_watdiv, '500M', 'partition').plotBox()
SDRank(config_watdiv, logs_watdiv, '500M', 'storage').plotBox()

Multi Dimensional Ranking¶

With the presence of the trade-offs introduced in the single dimensional ranking function, we propose an optimization technique that aims to find the non-dominated solutions or the configuration combinations by optimizing all dimensions at the same time which utilize the NSGA2 Algorithm.
In this experiment, we provide two ways to use the NSGA2 Algorithm:

  • The first method is paretoAgg which operates on the single dimensional ranking criteria. This method aims to maximize performance of the three ranks altogether
  • The second method is paretoQ which apply the algorithm considering the rank sets obtained by sorting each query results individually. This method aims at minimizing query runtimes of the ranked dimensions

PAPyA.Rank.MDRank¶

class Rank.SDRank(config_path, log_path, ranking_sets)¶

Parameters:
  config_path : str
  Specify the path to your configuration file(s). i.e ./UIModule/settings_watdiv.yaml</i>
 log_path : str
  Specify the path to your log file(s). i.e ./UI Module/log_watdiv</i>
 ranking_sets : str
  Ranking sets of user choice. i.e dataset sizes (100M)</i>

In [16]:
from Rank import MDRank

# example of MDRank class with 100M dataset size as ranking set of the experiment
multiDimensionRank_100M = MDRank(config_watdiv, logs_watdiv, '100M')
multiDimensionRank_250M = MDRank(config_watdiv, logs_watdiv, '250M')
multiDimensionRank_500M = MDRank(config_watdiv, logs_watdiv, '500M')

Rank.MDRank.paretoQ¶

MDRank.paretoQ()¶

This method returns a table of configuration solutions as well as their dominated ones, according to the Non-Dominated Sorting Algorithm II (NSGA2) that was applied to minimizing query runtimes of the ranked dimensions

In [17]:
multiDimensionRank_100M.paretoQ()
Out[17]:
Solution Dominated
0 pt.horizontal.parquet st.horizontal.orc
1 pt.subject.parquet st.horizontal.parquet
2 pt.horizontal.orc
3 pt.subject.orc
4 st.subject.orc
5 vp.horizontal.parquet
6 st.subject.parquet
7 vp.subject.orc
8 vp.subject.parquet
9 vp.horizontal.orc
In [18]:
multiDimensionRank_250M.paretoQ()
Out[18]:
Solution Dominated
0 pt.horizontal.parquet st.subject.parquet
1 pt.subject.orc st.horizontal.orc
2 pt.subject.parquet st.horizontal.parquet
3 pt.horizontal.orc
4 vp.horizontal.parquet
5 vp.subject.orc
6 vp.horizontal.orc
7 vp.subject.parquet
8 st.subject.orc
In [19]:
multiDimensionRank_500M.paretoQ()
Out[19]:
Solution Dominated
0 pt.subject.orc st.horizontal.orc
1 pt.subject.parquet st.subject.parquet
2 pt.horizontal.orc st.horizontal.parquet
3 pt.horizontal.parquet
4 vp.subject.orc
5 vp.horizontal.orc
6 vp.horizontal.parquet
7 vp.subject.parquet
8 st.subject.orc

Rank.MDRank.paretoAgg¶

MDRank.paretoAgg()¶

This method returns a table of configuration solutions as well as their dominated ones, according to the Non-Dominated Sorting Algorithm II (NSGA2) that was applied on the single dimensional ranking criteria which maximizes the performance of all ranking sets altogether

In [20]:
multiDimensionRank_100M.paretoAgg()
Out[20]:
Solution Dominated
0 st.subject.orc vp.horizontal.parquet
1 pt.horizontal.parquet pt.subject.orc
2 pt.horizontal.orc st.subject.parquet
3 pt.subject.parquet vp.subject.orc
4 st.horizontal.orc
5 vp.subject.parquet
6 vp.horizontal.orc
7 st.horizontal.parquet
In [21]:
multiDimensionRank_250M.paretoAgg()
Out[21]:
Solution Dominated
0 pt.horizontal.parquet pt.subject.orc
1 st.subject.orc vp.horizontal.parquet
2 vp.subject.orc
3 pt.subject.parquet
4 pt.horizontal.orc
5 vp.horizontal.orc
6 st.subject.parquet
7 vp.subject.parquet
8 st.horizontal.orc
9 st.horizontal.parquet
In [22]:
multiDimensionRank_500M.paretoAgg()
Out[22]:
Solution Dominated
0 pt.subject.orc vp.horizontal.parquet
1 st.subject.orc vp.horizontal.orc
2 pt.subject.parquet pt.horizontal.parquet
3 vp.subject.orc vp.subject.parquet
4 pt.horizontal.orc st.horizontal.orc
5 st.subject.parquet
6 st.horizontal.parquet

Rank.MDRank.plot¶

MDRank.plot()¶

This method returns a plot for multi dimensional ranking solutions according to paretoAgg as shades of green projected in a three dimensional space

In [23]:
multiDimensionRank_100M.plot()
multiDimensionRank_500M.plot()
(4, 3) (8, 3)
(5, 3) (7, 3)

Ranking Criteria Validation¶

This library provides two metrics of evaluation to evaluate the goodness of the ranking criteria the conformance and coherence

  • Conformance measures the adherence of the top-ranked configurations according to the actual query positioning of thoses configurations. We calculate conformance according to the equation below:
$$A(R^k) = 1 - \sum \limits _{i=0} ^{|Q|} \sum \limits _{j=0} ^{k} \frac {\bar{A}(i,j)}{|Q|*k}$$

Consider $R_{s}$ ranking and the top-3 ranked configurations are ${c_{1},c_{2},c_{3}}$, that overlaps only with the bottom-3 ranked configuration in query $|Q|$. That is, ${c_{4},c_{2},c_{5}}$. For example, $c_{2}$ is in the $59^{th}$ position out of 60 positions.
Thus, $A(R^k) = 1- \frac {1}{(11*3)}$, when $k = 3$ and $|Q| = 11$

  • Coherence is the measure agreement between two ranking sets that uses the same ranking criteria accross different experiments. We used Kendall's Index to calculate coherence, which counts the number of dis(agreements) between two ranking sets
$$K(R_{1}, R_{2}) = \sum \limits _{{i,j} \epsilon P} ^{} \frac {\bar{K}_{i,j}(R_{1}, R_{2})}{|P|}$$

In this experiment, we assume that rank sets are the dataset sizes (i.e. 100M and 250M). Kendall’s distance between two rank sets $R_{1}$ and $R_{2}$, where $|P|$ represents the set of unique pairs of distinct elements in the two sets. For instance, the $K$ index between $R_{1}={c_{1},c_{2},c_{3}}$ and $R_{2}={c_{1},c_{2},c_{4}}$ for 100M and 250M is 0.33, i.e., one disagreement out of three pair comparisons.

PAPyA.Ranker.Conformance¶

class Ranker.Conformance(config_path, log_path, ranking_sets, conformance_set, k, h)¶

Parameters:
  config_path : str
  Specify the path to your configuration file(s). i.e ./UIModule/settings_watdiv.yaml</i>
 log_path : str
  Specify the path to your log file(s). i.e ./UI Module/log_watdiv</i>
 ranking_sets : str
  Ranking sets of user choice. i.e dataset sizes (100M)</i>
 conformance_set : list
  List of ranking criterions to see the scores of their conformances.
 k : int
  Value of the top k subset of the ranking criteria. i.e. _k_ = 5, takes the top 5 value of each ranking criteria in the conformance set</i>
 h : int
  Value of the threshold that will be counted for the conformance score. i.e. _h_ = 28, take queries that has a ranking below 28 out of all the configuration</i>

In [24]:
from Ranker import Conformance

conformance_set = ['schemas', 'partition', 'storage', 'paretoQ', 'paretoAgg']
conf_100M = Conformance(config_watdiv, logs_watdiv, '100M', conformance_set, 2, 7)
conf_250M = Conformance(config_watdiv, logs_watdiv, '250M', conformance_set, 2, 7)
conf_500M = Conformance(config_watdiv, logs_watdiv, '500M', conformance_set, 2, 7)

Ranker.Conformance.run¶

Conformance.run()¶

This method returns a table of conformance scores for each of the ranking criterion that user specify in the conformance_set along with the k and h values for the formula

In [25]:
#conformance scores for all ranking criterions in 250M dataset size
conf_100M.run()
Out[25]:
100M
schemas 0.950
partition 0.675
storage 0.625
paretoQ 0.925
paretoAgg 0.900
In [26]:
conf_250M.run()
Out[26]:
250M
schemas 0.950
partition 0.275
storage 0.225
paretoQ 0.950
paretoAgg 0.700
In [27]:
conf_500M.run()
Out[27]:
500M
schemas 0.950
partition 0.075
storage 0.075
paretoQ 0.975
paretoAgg 0.575

Ranker.Conformance.configurationQueryRanks¶

Conformance.configurationQueryRanks(dimension, mode)¶

This method returns a criteria table of a choosen ranking dimension, criteria table is a table that shows rank values of each queries by the top-k configurations

Parameters:
  dimension : str
  user's choosen dimension to view the criteria table

  mode : 0 (default) or 1
  '0' show criteria table by their value
  '1' show criteria table with true and false of h value (true, if h value is higher than h)

In [28]:
conf_100M.configurationQueryRanks(dimension = 'schemas', mode = 0)
Out[28]:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
pt.horizontal.parquet 3 7 3 4 1 2 1 2 4 7 2 6 1 6 2 4 2 4 2 2
pt.horizontal.orc 7 5 2 3 2 1 4 1 7 1 12 10 5 7 1 3 3 1 5 4
In [29]:
conf_100M.configurationQueryRanks(dimension = 'partition', mode = 0)
Out[29]:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
st.subject.orc 11 6 7 5 5 3 5 4 9 8 3 3 8 1 5 1 5 7 4 3
st.subject.parquet 8 8 8 10 6 7 8 7 10 9 4 7 10 2 7 6 6 10 6 6
In [30]:
conf_100M.configurationQueryRanks(dimension = 'storage', mode = 0)
Out[30]:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
st.subject.orc 11 6 7 5 5 3 5 4 9 8 3 3 8 1 5 1 5 7 4 3
st.horizontal.orc 12 12 12 7 8 5 7 6 12 10 5 8 11 3 8 7 11 11 7 7
In [31]:
conf_250M.configurationQueryRanks(dimension = 'schemas', mode = 0)
Out[31]:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
pt.horizontal.parquet 5 5 2 1 4 1 4 1 2 1 4 7 1 2 1 3 3 1 2 2
pt.horizontal.orc 7 7 3 2 3 3 3 3 4 8 5 9 4 1 2 5 2 2 1 3
In [32]:
conf_250M.configurationQueryRanks(dimension = 'partition', mode = 0)
Out[32]:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
st.subject.orc 10 9 5 7 9 6 9 8 9 9 9 8 9 3 5 1 5 5 9 9
st.subject.parquet 9 10 6 10 10 10 10 10 10 10 10 10 11 7 9 7 9 8 11 10
In [33]:
conf_250M.configurationQueryRanks(dimension = 'storage', mode = 0)
Out[33]:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
st.subject.orc 10 9 5 7 9 6 9 8 9 9 9 8 9 3 5 1 5 5 9 9
st.horizontal.orc 12 12 10 11 11 11 11 11 11 11 11 11 10 6 11 9 10 11 10 11
In [34]:
conf_500M.configurationQueryRanks(dimension = 'schemas', mode = 0)
Out[34]:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
pt.subject.parquet 6 6 2 3 2 3 1 2 2 3 3 5 3 1 4 9 6 3 2 2
pt.horizontal.parquet 7 4 6 4 4 6 4 1 6 7 5 8 4 4 1 2 2 4 4 4
In [35]:
conf_500M.configurationQueryRanks(dimension = 'partition', mode = 0)
Out[35]:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
st.subject.orc 9 9 4 9 9 9 9 9 9 9 9 9 9 9 7 5 9 9 9 9
st.subject.parquet 10 11 10 11 11 11 11 11 10 11 11 11 11 11 11 11 11 11 11 11
In [36]:
conf_500M.configurationQueryRanks(dimension = 'storage', mode = 0)
Out[36]:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
st.horizontal.orc 11 10 11 10 10 10 10 10 11 10 10 10 10 10 10 10 10 10 10 10
st.subject.orc 9 9 4 9 9 9 9 9 9 9 9 9 9 9 7 5 9 9 9 9
In [37]:
conf_100M.configurationQueryRanks(dimension = 'schemas', mode = 1)
Out[37]:
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
pt.horizontal.parquet False False False False False False False False False False False False False False False False False False False False
pt.horizontal.orc False False False False False False False False False False True True False False False False False False False False
In [38]:
conf_100M.configurationQueryRanks(dimension = 'partition', mode = 1)
Out[38]:
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
st.subject.orc True False False False False False False False True True False False True False False False False False False False
st.subject.parquet True True True True False False True False True True False False True False False False False True False False
In [39]:
conf_100M.configurationQueryRanks(dimension = 'storage', mode = 1)
Out[39]:
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
st.subject.orc True False False False False False False False True True False False True False False False False False False False
st.horizontal.orc True True True False True False False False True True False True True False True False True True False False
In [40]:
conf_250M.configurationQueryRanks(dimension = 'schemas', mode = 1)
Out[40]:
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
pt.horizontal.parquet False False False False False False False False False False False False False False False False False False False False
pt.horizontal.orc False False False False False False False False False True False True False False False False False False False False
In [41]:
conf_250M.configurationQueryRanks(dimension = 'partition', mode = 1)
Out[41]:
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
st.subject.orc True True False False True False True True True True True True True False False False False False True True
st.subject.parquet True True False True True True True True True True True True True False True False True True True True
In [42]:
conf_250M.configurationQueryRanks(dimension = 'storage', mode = 1)
Out[42]:
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
st.subject.orc True True False False True False True True True True True True True False False False False False True True
st.horizontal.orc True True True True True True True True True True True True True False True True True True True True
In [43]:
conf_500M.configurationQueryRanks(dimension = 'schemas', mode = 1)
Out[43]:
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
pt.subject.parquet False False False False False False False False False False False False False False False True False False False False
pt.horizontal.parquet False False False False False False False False False False False True False False False False False False False False
In [44]:
conf_500M.configurationQueryRanks(dimension = 'partition', mode = 1)
Out[44]:
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
st.subject.orc True True False True True True True True True True True True True True False False True True True True
st.subject.parquet True True True True True True True True True True True True True True True True True True True True
In [45]:
conf_500M.configurationQueryRanks(dimension = 'storage', mode = 1)
Out[45]:
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
st.subject.orc True True False True True True True True True True True True True True False False True True True True
st.horizontal.orc True True True True True True True True True True True True True True True True True True True True

PAPyA.Ranker.Coherence¶

class Ranker.Coherence(config_path, log_path, conformance_set)¶

Parameters:
  config_path : str
  Specify the path to your configuration file(s). i.e ./UIModule/settings_watdiv.yaml</i>
 log_path : str
  Specify the path to your log file(s). i.e ./UI Module/log_watdiv</i>
 conformance_set : list
  List of ranking criterions to see their kendall index's scores between ranking sets (i.e. dataset sizes).

In [46]:
from Ranker import Coherence

coherence_set = ['schemas', 'partition', 'storage', 'paretoQ', 'paretoAgg']
coh = Coherence(config_watdiv, logs_watdiv,coherence_set)

Ranker.Coherence.run¶

Coherence.run(rankset1, rankset2)¶

This method returns a table of coherence scores for each of the ranking criterion that user specify in the conformance_set by calculating the number of (dis)agreements between 2 ranking sets

Parameters:
  rankset1 : str
  a string of the _first_ rankset that user wants to compare

  rankset2 : str
  a string of the _second_ rankset that user wants to compare

In [47]:
#example of coherence scores for all ranking criterions by comparing 100M ranking set with 250M ranking set
coh.run('100M', '250M')
Out[47]:
Kendall's Index
schemas 0.227273
partition 0.121212
storage 0.439394
paretoQ 0.333333
paretoAgg 0.500000
In [48]:
coh.run('100M', '500M')
Out[48]:
Kendall's Index
schemas 0.212121
partition 0.378788
storage 0.484848
paretoQ 0.444444
paretoAgg 0.933333
In [49]:
coh.run('250M', '500M')
Out[49]:
Kendall's Index
schemas 0.106061
partition 0.469697
storage 0.378788
paretoQ 0.333333
paretoAgg 0.600000

Ranker.Coherence.heatMap¶

Coherence.heatMap(rankset1, rankset2, dimension)¶

This method returns a heat map table that shows the coherence between two particular ranking sets that user can choose, the heat map will be sorted by the best performing configurations of the first ranking set

Parameters:
  rankset1 : str
  a string of the _first_ rankset that user wants to compare

  rankset2 : str
  a string of the _second_ rankset that user wants to compare
  dimension : str
  user's choosen dimension to view the heat map

In [50]:
coh.heatMap('100M', "250M", dimension='schemas')
In [51]:
coh.heatMap('100M', "250M", dimension='partition')
In [52]:
coh.heatMap('100M', "250M", dimension='storage')
In [61]:
coh.heatMap('100M', "250M", dimension='paretoAgg')
In [54]:
coh.heatMap('100M', "250M", dimension='paretoQ')

Ranker.Coherence.heatMapSubtract¶

Coherence.heatMap(*args, dimension)¶

This method shows the coherence differences between two particular ranking sets that user chooses, the first ranking sets will be the pivot point (between 100M-250M, and 100M-500M shown in the example). The heat map will be sorted by the best performing configurations of the first ranking set

Parameters:
  *args : str
  takes an arbitrary number of ranking sets, the first ranking set will be the pivot point for all the other ranking sets

  dimension : str
  user's choosen dimension to view the heat map

In [55]:
coh.heatMapSubtract('100M', '250M', '500M', dimension='schemas')
In [56]:
coh.heatMapSubtract('100M', '250M', '500M', dimension='partition')
In [57]:
coh.heatMapSubtract('100M', '250M', '500M', dimension='storage')
In [58]:
coh.heatMapSubtract('100M', '250M', '500M', dimension='paretoAgg')
In [59]:
coh.heatMapSubtract('100M', '250M', '500M', dimension='paretoQ')
In [ ]: